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Abstract 

Background: Genome-wide expression profiles reflect the transcriptional networks specific to the given cell 
context. However, most statistical models try to estimate the average connectivity of the networks from a collection 
of gene expression data, and are unable to characterize the context-specific transcriptional regulations. We propose 
an approach for mining context-specific transcription networks from a large collection of gene expression fold-change 
profiles and composite gene-set information. 

Results: Using a composite gene-set analysis method, we combine the information of transcription factor binding 
sites. Gene Ontology or pathway gene sets and gene expression fold-change profiles for a variety of cell conditions. 
We then collected all the significant patterns and constructed a database of context-specific transcription networks for 
human [REGNET). As a result, context-specific roles of transcription factors as well as their functional targets are readily 
explored. To validate the approach, nine predicted targets of E2F1 in HeLa cells were tested using chromatin 
immunoprecipitation assay. Among them, five {Gadd45b, Dusp6, MII5, Bmp2 and E2f3) were successfully bound 
by E2F1. c-JUN and the EMT transcription networks were also validated from literature. 

Conclusions: REGNET is a useful tool for exploring the ternary relationships among the transcription factors, their 
functional targets and the corresponding cell conditions. It is able to provide useful clues for novel cell-specific 
transcriptional regulations. The REGNET database is available at http://mgrc.kribb.re.kr/regnet. 
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Background 

Cells alter the process of transcriptional regulation so as 
to adjust to or drive changes in the cellular conditions 
between different stages of the cell cycle, stem cell differ- 
entiation or cancer development. Such changes in the 
transcriptional networks between different cell condi- 
tions are represented in the transcriptome data that are 
obtained using microarrays or high-throughput sequen- 
cing. However, elucidating the complex transcription net- 
works (TNs) has been a daunting task in spite of the 
remarkable advances in both computational modeling and 
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high-throughput experimental technologies [1-8]. One 
main reason for the difficulty is the dynamic nature of the 
networks: Transcription factors (TFs) not only regulate 
different targets depending on the cell conditions, but 
their effects on these targets can also change. Therefore, 
statistical models that estimate the average connectivity 
of the networks from a large collection of transcriptome 
data may confer limited accuracy. 

Taking these issues into account, we have developed an 
approach for identifying the context-specific TNs from a 
large collection of gene expression fold-change profiles. 
Toward this end, we collected 2,482 paired (test/ control) 
human microarray datasets encompassing a variety of cell 
conditions. Using a composite gene-set analysis method 
(ADGO) [9,10], we combined the information of TP 
binding site (TFBS), Gene Ontology or KEGG pathway 
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gene sets, and the gene expression fold-change profiles, 
and thereby tried to address the fundamental but largely 
open question of which TFs regulate which functions or 
pathways under differing cell conditions? 

Since the majority of TFBSs are potentially false posi- 
tives, we applied the following filtering criteria to iden- 
tify reliable TNs as follows. (A) A TFBS gene set (genes 
that share a common TFBS in their promoters) is re- 
quired to have a significant overlap with a functional 
gene set (genes that share a common annotation in GO 
or KEGG). (B) The genes that overlap, as a whole, 
should exhibit significant expression changes. (C) Such 
changes should be observed across many microarray 
conditions (e.g. ten or more). If the above three criteria 
are satisfied between a TFBS gene set and a functional 
gene set, we say the corresponding TF is associated 
with the functional gene set (Figure la), and the asso- 
ciated pairs as well as the corresponding conditions are 
output as context-specific TNs. 

We collected all of the 12,149,291 significant triples 
(TF, functional gene set, condition; q-values for the 
two conditions (A) and (B) <0.05) in order to construct a 
context-specific TN database, dubbed REGNET. It is 
used for exploring the ternary relationships among the 
TFs, their functional targets and the corresponding cell 
conditions (Figure lb). Each TF or function (or path- 
way) name can be used as a query, and REGNET pro- 
vides corresponding context-specific TNs. Using this 
database, we analyzed TNs for two TFs, E2F1 and c- 
JUN. Many known target genes as well as correspond- 
ing microarray conditions were identified. REGNET 
also included many predictions for novel targets. For 
example, we selected nine candidate targets of E2F1 
predicted for HeLa cells and tested them using a ChIP 
assay. Among them, five {Gadd45b, Dusp6, MllS, 
Bmp2 and E2f3) were successfully bound by E2F1. We 
also validated the TNs for c-JUN and the 'epithelial to 
mesenchymal transition (EMT)' gene set from the lit- 
erature. REGNET is available at http://mgrc.kribb.re. 
kr/regnet. 



Methods 

Collection of data 

Human gene expression microarray datasets of the same 
platform 'HG-U133 Plus 2' were downloaded from 839 
Gene Expression Omnibus [11] series. In each series 
dataset, we manually identified the test/control sample 
groups to collect log fold-change profiles. Many datasets 
contained multiple test groups, and hence 2,482 fold- 
change profiles were collected in total. Probe values that 
correspond to the same gene were averaged to yield 
gene-based fold-change profiles (20,361 genes), human 
TFBS gene sets were obtained from MSigDB [12]. For func- 
tional gene sets, the three categories of GO and KEGG 
pathways were used. GO gene sets were obtained from a 
gene product association file downloaded from a ftp site 
(http://www.geneontology.org/gene-associations/). Both the 
electronic and non-electronic annotations were used to 
maximize the coverage. KEGG gene sets were downloaded 
from MSigDB. All the offspring terms are again included in 
their parent terms. All the gene-sets with not less than 10 
and not more than 500 genes are included in the analysis 
and the system. 

Identification of context-specific transcription networks 

We applied the following three filtering criteria to iden- 
tify reliable TNs: 

(A) Overlap significance: the significance of overlap for 
every pair of a TFBS gene set and functional gene set 
is assessed using the hypergeometric distribution. The 
genes that overlap, if significant (default: FDR q-value 
< =0.02), are regarded as candidate targets of the TF, 
and may be interpreted as the channel through which 
the TF regulates the function (or pathway). Such over- 
lapping sets constitute the candidate TNs to be further 
examined throughout gene expression datasets. 

(B) Expression significance: For each pair of TFBS- 
functional gene sets that overlap significantly, 
another filtering criterion using a fold-change ex- 
pression dataset is applied: The Z-statistic on the 
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Figure 1 Overview of REGNET. (a) Association of a TF and functional gene set. (b) Ternary relationships explored by REGNET. 
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gene expression fold-change values is computed for 
each of the TFBS set, functional set and their over- 
lapping set, respectively. If the FDR q-value of the 
overlapping set is less than or equal to 0.02 (default 
value) and the q-values for the two individual sets 
are larger than 0.05 (default value), we assume the 
corresponding TN is activated on the given micro- 
array condition. In other words, expression changes 
in the overlapping genes should be pronounced 
compared to the two individual sets [9]. 
(C) Number of conditions: Only TNs that satisfy the 
above two criteria (A) and (B) across k (default 
number =10) or more microarray conditions are 
regarded as reliable transcription patterns. 

If all these criteria are satisfied between a TFBS gene set 
and functional gene set, the TF, candidate targets as well as 
the corresponding conditions are output as context-specific 
TN. Among the overlap set identified through (A) ~ (C), we 
can further select individual candidate targets that show 
some high fold expression changes (e.g. ±1.5 of higher) 
across a number of conditions (e.g., more than 0.3 x/c con- 
ditions). All the threshold values including q-values and k 
can be changed by the user for more thorough exploration 
of the transcriptional regulations. Figure 2 summarizes the 
context-specific TNs queried by TF and functional gene 
set, respectively. 

Chromatin Immunoprecipitation (ChIP) assay for E2F1 

Chip assays for E2F1 in HeLa cells were performed follow- 
ing the protocol represented at http://genomics.ucdavis. 
edu/farnham. HeLa cells were crosslinked for 10 min by 
adding formaldehyde to a final concentration of 1% with 
mild agitation. Crosslinking was stopped by the addition of 
glycine to a final concentration of 125 mM, and cells 



were washed three times with ice-cold PBS prior to har- 
vesting by scraping of the plates. Chromatid were frag- 
mented for 20 min (10 sec on pulse and 20 sec off 
pulse) to produce fragments ~500 nt in size using the 
Bioruptor sonicator (Diagenode). Antibody (SC251X; 
Santa Cruz Biotech) to E2F1 was used to pull down tar- 
get chromatid from 1 x 10 cells. Genomic DNAs were 
isolated from proteinase K-treated (45°C for 1 h) sam- 
ples and purified. ChIP samples were tested by PCR 
using positive and negative control primer sets shown 
in Additional file 1: Table SI. The quantitative real-time 
PCR for Chip is described in Supplementary Material. 

Results 

REGNET database 

REGNET is an intuitive and easy to use database devel- 
oped for exploring context-specific TNs (http://mgrc. 
kribb.re.kr/regnet). REGNET accepts two kinds of queries, 
i.e. TF and function (or pathway, localization) names, 
respectively. It provides a webpage presenting all the 
TF names analyzed, so that investigators can simply 
choose the TF of their interest. Each TF name indicates 
a TFBS gene set to which the TF can bind. Because a 
TF can have different TFBSs, many of the TF names 
are represented multiple times. Among them, '(TF 
name)_air is linked to the TNs that can be captured 
only by taking the union of all the TFBS sets of their 
common TF. For example, AHR all links to the TNs 
that can be captured only by using the union of the four 
different TFBS sets (V$AHRARNT_01, V$AHRARNT_02, 
V$AHR_01 and V$AHR_Q5). 

Each TF name is linked to a list of functional gene sets 
associated with the TF. Each line on the list is then 
linked to the corresponding target genes, descriptions for 
the TFBS and functional gene sets and the microarray 




Figure 2 Context-specific transcription networl<s for the query of a TF or functional gene set. (a) For the query of a TF, multiple functiona 

gene sets are associated. Q Cj Q represent the microarray conditions of the fold-change profiles for which the genes in the overlap parts 

significantly altered their expressions. The collection of the genes in the overlap parts are used for a global assessment (b) For the query of a 
functional gene set, multiple TF's are associated. 
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conditions under which the target genes exhibit significant 
changes in expression. The 'number of conditions' leads 
to a table of the corresponding test/control microarray 
conditions. At the bottom of the table, a heat map of the 
log fold-change values of the target genes is displayed, 
summarizing the information for the associated pair of the 
TF and functional gene set (Figure 3). From this heat map, 
investigators can easily identify candidate targets that ex- 
hibit large expression changes across a number of condi- 
tions. The fold-change values of the TF itself are also 
provided on the right-end column of the heat map, for the 
activity of certain TFs can be usefully inferred from their 
transcript levels. 

When querying for a functional gene set, a keyword is 
required for searching. For example, if the investigators 
want to identify TNs related to the 'epithelial to mesenchy- 
mal transition' process, entering 'mesen' is sufficient. Then, 
the names of functional gene sets that contain 'mesen' 
are all displayed. From among them, investigators can 
choose the exact gene set of their interest, and then all 
the TFs associated with that functional gene set are listed. 



As in the case of the TF query, each line is linked to de- 
tailed information on the target genes, TFBS sets and the 
queried gene set as well as the corresponding microarray 
conditions. More detailed information on the REGNET 
database may be seen from our web site. 

Analysis results for E2F1 

We used the human TFBS gene sets from the Molecular 
Signatures Database (MSigDB) [12]. The MSigDB pro- 
vides five E2F1 TFBS sets. The predicted results for E2F1 
are accessible by choosing 'E' in the 'TF-based browsing' 
webpage. As expected, many cell-cycle related func- 
tional gene sets were found to be related to most of the 
TFBS gene sets. For example, the 'DNA replication 
(GO:0006260)' and 'Gl/S transition of mitotic cell 
cycle (GO:0000082)' are associated with four of the five 
TFBS sets while the 'cell cycle (KEGG)' and 'cell cycle 
checkpoint (GO:0000075)' are associated with three. 
For each gene set, many conditions likely to activate cell 
cycle progression were observed, e.g. iPS or diverse cancer 
cell conditions. 
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Figure 3 Condition table and heat maps for transcription networlcs of E2F1. (a) The condition table and heat map for the association 
between E2F1 and the gene set 'transcription activator activity (GO:0016563)'. Several targets exhibit clear activations across a number of virus 
treated HeLa cells. Seven genes, Bmp2, E2f3, Epcl, Fosll, MII5, Myc and Nr4a3 are chosen for validation using a ChIP assay, (b) The heat map for 
the association between E2F1 and the gene set 'negative regulation of protein kinase activity (GO:0006469)'. A part of the 55 conditions that 
associate E2F1 and the GO gene set are shown. Among the targets, GaddASb and Dusp6 exhibit clear expression changes across a variety of 
microarray conditions. Both genes are tested using a ChIP assay. 
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For the union of all the E2F1 TFBS gene sets (E2Fl_all), 
the 'transcription activator activity (GO:0016563)' set is 
captured and several of the putative targets exhibit high 
expression fold-changes for a number of the HeLa cell 
conditions treated with vaccinia virus (GSE 11238): Bmp2, 
E2f3, Epcl, Fosll, Mils, Myc and Nr4a3 (Figure 3a). An- 
other target set, 'negative regulation of protein kinase 
activity (GO:0006469)' is found to be associated with 
the E2Fl_aU set through 55 conditions. Among the targets, 
in particular, Dusp6 and Gadd4Sb exhibit high expression 
fold-changes across dozens of heterogeneous conditions 
(Figure 3b). Although these 55 conditions do not include 
the HeLa cells, it was found that both genes display more 
than two-fold increases in their expression levels in most of 
the virus-treated HeLa conditions (GSE 11238). 

To validate the regulatory relationships, therefore, we 
performed a ChIP assay on HeLa cells, as described in 
Methods, for the nine candidate targets selected from 
the two associated sets 'transcription activator activity' 
and 'negative regulation of protein kinase activity'. To 
this end, we activated E2F1 via the CDK4-Rb-E2F1 path- 
way [13] as described in Additional file 1: Figure SI. We 
used Gapdh as a control. The ChIP results are shown in 
Figure 4. Among the targets tested, five exhibited posi- 
tive results: The binding of E2F1 to Gadd45b, Dusp6 
and Mils was pronounced with six or higher fold- 
changes compared to a control IgG, and the binding to 
Bmp2 and E2f3 exhibited approximate five and three- 
fold increases, respectively. 

Analysis results for c-JUN 

The MSigDB provides eleven c-JUN TFBS sets. The pre- 
dicted results for c-JUN are accessible by choosing 'J' in 
the TF-based browsing webpage. The first TFBS set 
named 'TGANTCA_V$AP1_C' contains 866 genes, and 
129 GO or KEGG gene sets are associated with this TFBS 
set as the result of satisfying the three filtering criteria 



(A) ~ (C). Among them, we identified two cell condi- 
tions for which c-JUN targets are most clearly activated. 
For example, the 'response to steroid hormone stimulus 
(GO:0048545)' gene set is associated with the c-JUN 
TFBS set for 56 microarray conditions. Among them, 
the HeLa cells treated with vaccinia virus (GSE 11238) 
clearly exhibits activation in five targets: Adm, F3, Fosll, 
116 and Thbsl (Additional file 1: Figure S2a). The latter 
four genes are known targets of c-JUN [14,15]. Thbsl is 
known to be repressed by c-JUN in rat embryo fibroblasts 
[16], but it is also known to be activated by c-JUN in hu- 
man hepatocarcinoma cell lines [15]. ADM is reported to 
induce the phosphorylation of c-JUN in glioblastoma cells 
[17], but is also a candidate target of c-JUN in our predic- 
tion. Because the known targets of c-JUN as well as Jun it- 
self display high expression fold changes, we infer that 
viral infection to HeLa cells strongly increases the activity 
of c-JUN. In the 'pathways in cancer (KEGG)' gene set, 
we found another group of known c-JUN targets, i.e. 116, 
Lambs, Lamc2, Mmpl and Mmp9 [14], that are strongly ac- 
tivated for the conditions of 'engineered human skin' (GSE 
17539) [18] and 'low dose treatment of 5-aza-2-deoxycyti- 
dine on an non-small cell lung cancer cell Une' (GSE 6695) 
[19] (Additional file 1: Figure S2b). If another TFBS set 
V$AP1FJ_Q2 is chosen, the well-known c-JUN interact- 
ing partner Fos as well as 116 are captured in the 'path- 
ways in cancer (KEGG)' gene set, and Fos exhibits a very 
strong co-expression pattern with Jun in many of the 
virus-treated HeLa cell conditions (Additional file 1: 
Figure S2c). 

Summarizing the results for all the c-JUN TFBS gene 
sets, in total 156 functional gene sets are associated with 
at least one of the eleven TFBS gene sets. Of them, 52 
functional gene sets (approximately 33%) are associated 
with two or more. This suggests that c-JUN binds to a 
variety of different TFBSs in order to ensure the regula- 
tion of its target pathways. For example, 'pathways in 




Gadd45b MII5 Dusp6 Bmp2 E2f3 Epc-1 Nr4a3 Myc Fosll Gapdh 

Figure 4 Cliromatin IP experimental results for E2F1. The nine candidate targets of E2F1 were chosen for validation by gene-specific ChIP. 
The fold change (FC) is the fold increase of the signal from E2F1 antibody (black)-enriched chromatin relative to a control IgG (open). The negative 
control Gapdh exhibited no enrichment 
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cancer (KEGG)' is associated with eight different c-JUN 
TFBS gene sets, highlighting the role of c-JUN as a key 
factor of cancer progression. The representative gene 
sets associated with three or more c-JUN TFBS sets are 
shown in Table 1. 

Analysis results for the EMT gene set 

We now illustrate how to analyze a functional gene set 
using REGNET. Entering the keyword 'mesen' from the 
'Keyword Search' page, nine similar functional gene sets 
were presented. Among them, we chose the first one, i.e. 
the 'epithelial to mesenchymal transition (GO:0001837)'. 
This set is associated with eight TPs (MAZ, F0X04, 
PITX2, SPl, NFAT, PAX4, ZEBl and LEFl). Except for 
MAZ and PAX4, each TF is reported to regulate EMT 
process. We then investigated the conditions under which 
each TF might regulate the EMT process. Many putative 
conditions that are likely to trigger EMT were observed. 
For example, PCDH24-expressing HCT116 cells shown to 
abolish tumor formation in vivo have -2000 differentially 
expressed genes. These expression changes have been 
confirmed to be quite similar to EMT via further prote- 
omics analysis using 2-DE/MS (GSE10650). Our pre- 
dictions on EMT suggest MAZ as a candidate TF to 
regulate EMT in the HCT116 cells. Among the candi- 
date targets of MAZ in the EMT gene set, Bmp7, 
Collal, Sfrp2, Wnt2 and WntSa exhibit two or higher 
expression fold-changes. Interestingly, four of the tar- 
gets except for Sfrp2 were also captured among the 
PAX4 results for the same cell condition. This indicates 
PAX4 and MAZ possibly cooperate on their four com- 
mon targets to regulate EMT in the HCT cells. For an- 
other example, the embroid body cells (GSE9196) the 
differentiation of which is known to accompany EMT 

Table 1 Functional gene sets associated with c-JUN 



[20] also exhibit clear activation in many of the MAZ 
targets (Additional file 1: Figure S3a). 

Among the conditions, HUVECs treated with TNF- 
alpha are captured by five of the eight TF results 
(Additional file 1: Figure S3b-e). By mapping each TF to 
the targets that exhibit two or higher fold-changes, the 
EMT transcription networks in HUVECs treated with 
TNF-alpha were constructed that comprised five TFs 
and six target genes (Figure 5). TNF-alpha is a key cyto- 
kine involved in inflammation and cancer, and is known 
to induce EMT [21]. Because HUVECs do not contain 
epithelial cells, such EMT signals may indicate 'endothe- 
lial' to mesenchymal transitions. 

Although each regulatory relationship in the networks 
needs to be validated, it is noteworthy that the predicted 
edges of the networks were evenly distributed among 
the five TFs. This implies that the EMT in HUVECs is 
processed by a tight cooperation of these regulators. 
Some of the reports for the predicted regulatory rela- 
tionships are provided in Additional file 1: Table S2. 

A global assessment of REGNET 

Because the predicted TNs in REGNET are context- 
specific (the targets or effects of a TF can change de- 
pending on the cell conditions), reliable gold-standard 
TNs for a global assessment are rarely available. For ex- 
ample, each ChlP-seq or ChlP-chip dataset is also ob- 
tained under a specific cell condition and may not serve 
as a general gold-standard set. Therefore, instead of con- 
sidering individual cell conditions, we merged all the 
context-specific targets of a TF for our prediction set 
(the union of the intersection parts in Figure 2a), and 
merged multiple ChlP-seq or ChlP-chip data for gold- 
standard targets of a TF. Among the prediction set, we 



sets* 



Functional gene set codes Functional gene set names #TFBS 

KEGG PATHWAYS IN CANCER (KEGG) Pathways in cancer 8 

GO:0000165 (GOBP) MAPKKK cascade 6 

GO:0043086 (GOBP) Negative regulation of catalytic activit/ 5 

GO:0031012 (GOCC) Extracellular matrix 4 

GO:0045859 (GOBP) Regulation of protein kinase activity 4 

KEGG MAPK SIGNALING PATHWAY (KEGG) MARK signaling pathway 4 

GO:0001 501 (GOBP) Skeletal system development 3 

GO:0001 525 (GOBP) Angiogenesis 3 

GO:0007409 (GOBP) Axonogenesis 3 

GO:0018193 (GOBP) Peptidyl-amino acid modification 3 

GO:0030334 (GOBP) Regulation of cell migration 3 

GO:0032270 (GOBP) Positive regulation of cellular protein metabolic process 3 

GO:0043055 (GOBP) Negative regulation of apoptosis 3 

GO:0048812 (GOBP) Neuron projection morphogenesis 3 



*The number of c-JUN TFBS gene sets associated with each functional gene set. 
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Figure 5 EMT transcription networks in HUVECs inferred from REGNET. The sun symbols indicate the corresponding TFs or TF-target 
relationships have strong supports from the literature. The circle symbols indicate weak supports from the literature. The evidences from 
the literature are all summarized in Additional file 1 : Table S2. The graph is depicted using the Graphviz software. 



further selected candidate targets that exhibited ±1.5 or 
higher fold expression changes across 0.3 x/c conditions. 
We vaHdated the predicted targets of E2F1, JUN and 
TP53. Because these TFs have muhiple known binding 
sites in MSigDB, we merged all the corresponding targets 
for each TF. The targets reported from ChIP data are 
downloaded from ChE A curated database [22] . Three, one 
and six ChIP datasets were available for E2F1, JUN and 
TP53, respectively and corresponding targets were merged 
for each TF. These targets with known TFBS were further 
selected for true positive gold standard sets. 

Among the 564 genes with an E2F1 binding site, 65 
were the predicted as context-specific targets. Among the 
270 true positive set, 39 were included in the 65 predicted 
set. This corresponds to specificity 0.9116 and sensitivity 
0.1444 with a significant p-value 0.0256 against a random 
prediction (hypergeometric distribution). The true positive 
set used here is obtained from a collection of high- 
throughput Chip datasets and is missing the three targets 
Dusp6, M115 and Gadd45b, whOe these targets were identi- 
fied from our ChIP assay. If we include these three in the 
positive set, we have an improved sensitivity 0.1538 with a 
significant p-value 0.0039. Unfortunately, JUN had only one 
Chip data in ChEA database. Based on this single ChIP 
dataset, we had specificity 0.7853, sensitivity 0.2593 but 
with a less significant p-value 0.1734. Lastiy, TP53 had six 
Chip datasets. Among the 426 genes with an P53 binding 



site, 142 were true positives and the 50 predicted targets 
included 25 true positives. This corresponds to specificity 
0.9120, sensitivity 0.1761 and a significant p-value 0.0071. 

In summary, REGNET provides highly specific pre- 
dictions for context-specific transcriptional regulations. 
Though less sensitive, they still contain a number of novel 
regulatory relationships as shown in our ChIP assay and 
the EMT networks. 

Assessment for extended GO and KEGG annotations 

Because many genes are still unannotated in GO and 
KEGG, transcriptional regulations for such genes are not 
covered by REGNET. To ameliorate the problem, we ex- 
tended GO and KEGG annotations using protein- 
protein interaction networks [23], and tested how the 
extended annotations affect the predictions in REGNET. 
The annotations are extended as follows: For each gene 
gi (protein), we identified the set of interacting proteins 
Li from the BioGrid database [24]. For each functional 
gene set (GO or KEGG annotation), say Fj, we assessed 
the significance of overlap between L, and Fy using 
hypergeometric distribution. If an overlap p-value < po, 
we assign the annotation Fj to the gene gi. 

In the original GO and KEGG annotations, 2,409 GO 
biological process terms (GO-BP), 369 GO cellular com- 
ponent terms (GO-CC), 186 GO molecular function 
terms (GO-MF) and 666 KEGG terms are used to 
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annotate genes 164,562, 26,270, 12,875 and 41,063 
times respectively. When a threshold po = 10£-4 is ap- 
plied, GO-BP, GO-CC, GO-MF and KEGG terms are 
used 233,878, 37,455, 23,370 and 54,142 times exhibiting 
overall 29.8% increase. Using these extended annotations, 
we regenerated significant patterns in REGNET and 
extracted context-specific targets of E2F1 and TP53, 
respectively. With the original annotations, 65 and 50 
context-specific targets were identified for E2F1 and 
TP53, but the extended annotations increased these 
numbers to 183 and 71, respectively. The extended pre- 
dictions, as expected, exhibited somewhat reduced spe- 
cificities 0.7313 and 0.8627, and increased sensitivities 
0.3852 and 0.2253 for E2F1 and TP53, respectively. 

Here, we illustrated how current annotations can be 
extended using protein interaction data and how such 
extension could affect the predictions in REGNET, but 
such extensions are not included in the current REGNET 
system. 

Discussion and conclusions 

For a given TF or a functional gene set, REGNET provides 
the associated TNs as well as the corresponding condi- 
tions. From the heat map provided for each of the associ- 
ated pairs of a TF and functional gene set, investigators 
can select putative targets that exhibit large expression 
changes across a number of different conditions. We note 
that the transcription networks inferred in our approach 
have multiple lines of evidences and a substantial portion 
of such predictions are highly likely true as validated by re- 
ports in the literature and the ChIP assay in the finding re- 
ported here. 

For a global validation, we pooled targets from multiple 
Chip datasets and pooled context-specific predictions for 
each of three TFs. As expected, predictions in REGNET 
were highly specific and less sensitive. For example, 
REGNET provides 65 context-specific targets of E2F1, 42 
of which were actually bound by the TF. More import- 
antly, REGNET suggests the cell conditions in which such 
regulation patterns occur. Despite the high specificity, pre- 
dictions by REGNET included many novel targets provid- 
ing useful clues for unraveling cellular processes. 

Because many microarray datasets have only a small 
number of samples, we applied a gene randomizing gene 
set analysis method [25,26] to assess the overall expres- 
sion changes in each gene set. Most gene-randomizing 
methods, however, suffer from increased false-positives 
[26]. To reduce false positives, we applied an auxiliary 
filter that requires a significant regulation pattern to be 
observed across a number of different conditions. 

Specifically, REGNET deploys a composite gene-set ana- 
lysis method [9,10] to combine three sources of genomic 
information: TFBS gene set, functional gene set and 
gene expression fold-change profile. As a result, three- 



dimensional information among the TFs, their functional 
targets and cell conditions are readily explored. Most other 
approaches including DAVID [27], the Molecular Concept 
Map [28] and ConceptGen [29] investigate the 'binary' rela- 
tionships between a TFBS gene set and another gene list. 
One may apply the overlap of gene-sets once more to in- 
vestigate such three-dimensional relationships, but the re- 
peated use of gene-set overlaps may cause a serious loss of 
power due to the use of threshold values [26,30]. 

Because the activity of a TF is estimated by the overall 
pattern of expression change in its target genes, REGNET 
is most useful for finding relatively strong transcriptional 
activities in which a TF regulates multiple target genes 
simultaneously. On the other hand, specific regulation 
of one or two targets that have only small expression 
changes might be missed in this approach. Another 
limitation is that by using the TFBS information instead 
of genome-wide ChIP data, which are mostly unavail- 
able, we only consider the sequence-specific transcrip- 
tional regulation. Furthermore, many gene functions are 
still unknown and remain unannotated in GO or KEGG. 
We here demonstrated a method to extend annotation 
gene-sets using protein interaction data and its effect on 
the performance of REGNET. More reliable extension 
of gene annotation may be possible by integrating di- 
verse types of genomic data and network information 
which may help identifying missing transcriptional asso- 
ciations in REGNET [31-33]. 

Over the last decade, more than a million gene expres- 
sion microarray datasets have been deposited in public 
databases [11,34,35], covering virtually all of the cell condi- 
tions of interest including a broad range of tissues, dis- 
eases, development and treatments. Most of them are 
made up of test and control samples, and their fold-change 
values provide information on the TNs specific to each 
condition. For this reason, we have collected the fold- 
change profiles for a diverse array of cell conditions in 
order to identify context-specific TNs. We have collected 
2,482 human expression fold-change profdes in our 
current database. As this number increases, it is expected 
that more specific TNs are newly identified. Moreover, de- 
veloping a similar system for other species would be valu- 
able for the purpose of comparative analysis. 
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